Syntactic processing of the IPI PAN Corpus of Polish

نویسندگان

  • Adam Przepiórkowski
  • Aleksander Buczyński
  • Daniel Janus
چکیده

The aim of this paper is to present recent and ongoing work on adorning the IPI PAN Corpus of Polish (Przepiórkowski 2004, 2006a) with partial syntactic annotation, with the ultimate aim of building a treebank of Polish. The work described here is a part of the project Automatic extraction of linguistic knowledge from a large corpus of Polish (a Ministry of Education and Science grant number 3T11C00328), aiming at the automatic construction of a valence dictionary.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish

This paper introduces a new set of tools and resources for Polish which cover all the steps required to transform a raw unrestricted text into a reasonable input for a parser. This includes (1) a large-coverage morphological lexicon, developed thanks to the IPI PAN corpus as well as a lexical acquisition techique, and (2) multiple tools for spelling correction, segmentation, tokenization and na...

متن کامل

On Heads and Coordination in Valence Acquisition

The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN Corpus of Polish [22] and the corresponding extension of the corpus search engine Poliqarp [25,12] developed at the Institue of Computer Science PAS and currently employed in Polish and Portuguese corpora projects. In particular, we will argue for the need to distinguish between, and represent both, ...

متن کامل

Corpus, Medical Text, Annotation Morpho-syntactic Tagging, Natural Language Processing Corpus of Medical Texts and Tools

There is only one large corpus of Polish annotated with morpho-syntactic information, namely The IPI PAN Corpus (IPIC). This situation is a big obstacle in creation of tools for natural language processing dedicated to the domain of medical texts. However, the real life medical texts exhibit features making them very distinct from the most of the texts stored in IPIC. In the paper, the attempts...

متن کامل

An Implementation of Combined Partial Parser and Morphosyntactic Disambiguator

The aim of this paper is to present a simple yet efficient implementation of a tool for simultaneous rule-based morphosyntactic tagging and partial parsing formalism. The parser is currently used for creating a treebank of partial parses in a valency acquisition project over the IPI PAN Corpus of Polish.

متن کامل

A Rule-Based Tagger for Polish Based on Genetic Algorithm

In the paper an approach to the construction of rule-based morphosyntactic tagger for Polish is proposed. The core of the tagger are modules of rules (classification systems), acquired from the IPI PAN corpus by application of Genetic Algorithms. Each module is specialised in making decisions concerning different parts of a tag (a structure of attributes). The acquired rules are combined with l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007